Notes:
Notes:
#Setting styling of plots to the same as in the instructor videos
#theme_set(theme_minimal(24))
#Overriding with my own preference
#theme_set(theme_minimal(14))
#Remember to set correct working directory first
pf <- read.delim("pseudo_facebook.tsv")
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
summary(pf)
## userid age dob_day dob_year
## Min. :1000008 Min. : 13.00 Min. : 1.00 Min. :1900
## 1st Qu.:1298806 1st Qu.: 20.00 1st Qu.: 7.00 1st Qu.:1963
## Median :1596148 Median : 28.00 Median :14.00 Median :1985
## Mean :1597045 Mean : 37.28 Mean :14.53 Mean :1976
## 3rd Qu.:1895744 3rd Qu.: 50.00 3rd Qu.:22.00 3rd Qu.:1993
## Max. :2193542 Max. :113.00 Max. :31.00 Max. :2000
##
## dob_month gender tenure friend_count
## Min. : 1.000 female:40254 Min. : 0.0 Min. : 0.0
## 1st Qu.: 3.000 male :58574 1st Qu.: 226.0 1st Qu.: 31.0
## Median : 6.000 NA's : 175 Median : 412.0 Median : 82.0
## Mean : 6.283 Mean : 537.9 Mean : 196.4
## 3rd Qu.: 9.000 3rd Qu.: 675.0 3rd Qu.: 206.0
## Max. :12.000 Max. :3139.0 Max. :4923.0
## NA's :2
## friendships_initiated likes likes_received
## Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 17.0 1st Qu.: 1.0 1st Qu.: 1.0
## Median : 46.0 Median : 11.0 Median : 8.0
## Mean : 107.5 Mean : 156.1 Mean : 142.7
## 3rd Qu.: 117.0 3rd Qu.: 81.0 3rd Qu.: 59.0
## Max. :4144.0 Max. :25111.0 Max. :261197.0
##
## mobile_likes mobile_likes_received www_likes
## Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 4.0 Median : 4.00 Median : 0.00
## Mean : 106.1 Mean : 84.12 Mean : 49.96
## 3rd Qu.: 46.0 3rd Qu.: 33.00 3rd Qu.: 7.00
## Max. :25111.0 Max. :138561.00 Max. :14865.00
##
## www_likes_received
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 2.00
## Mean : 58.57
## 3rd Qu.: 20.00
## Max. :129953.00
##
Notes:
#install.packages('ggplot2')
library(ggplot2)
qplot(x = dob_day, data = pf, binwidth = 1) +
#Setting bins to be 1 for each day of the month
scale_x_continuous(breaks=1:31)
#Also possible with ggplot()
ggplot(aes(x = dob_day), data = pf) +
geom_histogram(binwidth = 1) +
scale_x_continuous(breaks = 1:31)
Response: I notice that a disproportionate amount of users have birthdays on the first day of the month. I suspect this is due to incorrect information entered by the user: the easiest way to fill out date information in a form is to leave the day at 1.
Fewer users have birthdays on day31, compared to other dates, which makes sense as only 7 out of 12 months in a year have 31 days. ***
Notes: There’s a mismatch between people’s perception of the audience size of their own facebook posts, and the actual audience size. ***
Notes:
Response: 60
Response: 15%
Notes: Moira says that people dramatically underestimated the size of their audience. They thought it was 25% of what it actually was.
Notes:
qplot(x = dob_day, data = pf, binwidth = 1) +
scale_x_continuous(breaks = 1:31) +
facet_wrap(~dob_month, ncol = 3)
Response: My previous suspicion is consistent with what we see here: of the users who selected day 1 of the month almost all of them also selected month 1, indicating incorrect user input. ***
Notes: Have to consider anamolies/outliers in the context of your data. ***
Notes: #### Which case do you think applies to Moira’s outlier? Response:
Notes:
qplot(x = friend_count, data = pf)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Experimenting to make it better
qplot(x = friend_count, data = subset(pf, friend_count < 1000), binwidth = 10)
Response: Some outliers have close to 5000 friends, which makes it hard to distinguish the finer differences among the majority of users, which have less than 1000 friends.
Long-tail data.
Notes:
qplot(x = friend_count, data = pf, xlim = c(0, 1000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
#Same plot, different method
qplot(x = friend_count, data = pf) +
scale_x_continuous(limits = c(0, 1000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
Notes:
Notes:
qplot(x = friend_count, data = pf, binwidth = 25) +
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50))
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
#Equivalent ggplot syntax:
ggplot(aes(x = friend_count), data = pf) +
geom_histogram(binwidth = 25) +
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50))
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
# What code would you add to create a facet the histogram by gender?
# Add it to the code below.
qplot(x = friend_count, data = pf, binwidth = 10) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50)) +
facet_wrap(~gender, ncol = 1, strip.position = "bottom")
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
In the alternate solution below, the period or dot in the formula for facet_grid() represents all of the other variables in the data set. Essentially, this notation splits up the data by gender and produces three histograms, each having their own row.
qplot(x = friend_count, data = pf) +
facet_grid(gender ~ .)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#my version
qplot(x = friend_count, data = pf, binwidth = 10) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50)) +
facet_grid(gender ~ .)
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
#Equivalent ggplot syntax:
ggplot(aes(x = friend_count), data = pf) +
geom_histogram() +
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
facet_wrap(~gender, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
Notes:
qplot(x = friend_count, data = subset(pf, !is.na(gender)) ,binwidth = 10) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50)) +
facet_wrap(~gender, strip.position = "bottom")
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
#Equivalent ggplot syntax:
ggplot(aes(x = friend_count), data = subset(pf, !is.na(gender))) +
geom_histogram(binwidth = 10) +
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
facet_wrap(~gender)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
Notes:
table(pf$gender)
##
## female male
## 40254 58574
by(pf$friend_count, pf$gender, summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 27 74 165 182 4917
Response: women
Response: 22
Response:To avoid extreme outliers having to large impact.
Notes:
qplot(x = tenure, data = pf, binwidth = 30,
color = I('black'), fill = I('#099DD9'))
## Warning: Removed 2 rows containing non-finite values (stat_bin).
#Equivalent ggplot syntax:
ggplot(aes(x = tenure), data = pf) +
geom_histogram(binwidth = 30, color = 'black', fill = '#099DD9')
## Warning: Removed 2 rows containing non-finite values (stat_bin).
qplot(x = (tenure/365), data = pf, binwidth = .25,
color = I('black'), fill = I('#099009') ) +
scale_x_continuous(breaks = seq(0, 7, 1), lim = c(0, 7) )
## Warning: Removed 26 rows containing non-finite values (stat_bin).
ggplot(aes(x = tenure/365), data = pf) +
geom_histogram(binwidth = .25, color = 'black', fill = '#F79420') +
scale_x_continuous(breaks = seq(1, 7, 1), lim = c(0, 7) )
## Warning: Removed 26 rows containing non-finite values (stat_bin).
Notes:
qplot(x = (tenure/365), data = pf, binwidth = .25,
color = I('black'), fill = I('#099009'),
xlab = 'Number of years using Facebook',
ylab = 'Number of users in sample') +
scale_x_continuous(breaks = seq(0, 7, 1), lim = c(0, 7) )
## Warning: Removed 26 rows containing non-finite values (stat_bin).
#Equivalent ggplot syntax:
ggplot(aes(x = tenure / 365), data = pf) +
geom_histogram(color = 'black', fill = '#F79420') +
scale_x_continuous(breaks = seq(1, 7, 1), limits = c(0, 7)) +
xlab('Number of years using Facebook') +
ylab('Number of users in sample')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 26 rows containing non-finite values (stat_bin).
Notes:
qplot(x = age, data = pf, binwidth = 1) +
geom_histogram(color = 'black', fill = '#099009') #+
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#scale_x_continuous()
ggplot(aes(x = age), data = pf) +
geom_histogram(color = 'black', fill = '#099009', binwidth = 1) +
scale_x_continuous(breaks = seq(10, 120, 2), lim = c(10, 120))
#From course: Equivalent ggplot syntax:
ggplot(aes(x = age), data = pf) +
geom_histogram(binwidth = 1, fill = '#5760AB') +
scale_x_continuous(breaks = seq(0, 113, 5))
Response: Strange outliers: way too many users are 102 and 108 years old. There’s a strange drop at 22 and 24. 21, 23 and 25 are higher. The age mode of the sample is 18, with roughly 5100 users, with 19 and 23 second with roughly 4450 users each. The age of the sample is left-skewed (above minimum age). No users are under 13, which is due to legal requirements. ***
Notes:
Notes:
Notes:
summary(pf$friend_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 31.0 82.0 196.4 206.0 4923.0
summary(log10(pf$friend_count + 1)) #+1 to avoid infinity due to 0 friends
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.505 1.919 1.868 2.316 3.692
summary(sqrt(pf$friend_count))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.568 9.055 11.090 14.350 70.160
Notes:
#install.packages("gridExtra")
library(gridExtra)
p1 <- qplot(x = friend_count, data = pf, binwidth=10)
p2 <- qplot(x = friend_count+1, data = pf) +
scale_x_log10() +
xlab("Friend count, logarithmic scale")
p3 <- qplot(x = friend_count, data = pf) +
scale_x_sqrt() +
xlab("Friend count, squared values")
#Alternative square plot
p4 <- qplot(x = sqrt(friend_count), data = pf)
#Alternative log plot
p5 <- qplot(x = log10(friend_count+1), data = pf)
grid.arrange(p1, p2, p3, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Alternate solution in ggplot
p1 <- ggplot(aes(x=friend_count), data = pf) +
geom_histogram()
p2 <- p1 + scale_x_log10()
p3 <- p1 + scale_x_sqrt()
grid.arrange(p1, p2, p3, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Good for comparing 2 or more distributions at once
#without
qplot(x = friend_count, data = subset(pf, !is.na(gender)),
binwidth = 10) +
scale_x_continuous(lim = c(0, 1000), breaks = seq(0, 1000, 50) ) +
facet_wrap(~gender)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
#with
qplot(x = friend_count, data = subset(pf, !is.na(gender)),
binwidth = 10, geom = 'freqpoly', color = gender) +
scale_x_continuous(lim = c(0, 1000), breaks = seq(0, 1000, 50) )
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
#Using proportions instead of raw count
qplot(x = friend_count, y = ..count../sum(..count..),
data = subset(pf, !is.na(gender)),
xlab = 'Friend Count',
ylab = 'Proportions of users with that friend count',
binwidth = 10, geom = 'freqpoly', color = gender) +
scale_x_continuous(lim = c(0, 1000), breaks = seq(0, 1000, 50) )
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
#Equivalent ggplot syntax:
ggplot(aes(x = friend_count, y = ..count../sum(..count..)), data = subset(pf, !is.na(gender))) +
geom_freqpoly(aes(color = gender), binwidth=10) +
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
xlab('Friend Count') +
ylab('Percentage of users with that friend count')
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
#more accurate, as it shows porportions per color, not of total
qplot(x = friend_count, y = ..density../sum(..density..),
data = subset(pf, !is.na(gender)),
xlab = 'Friend Count',
ylab = 'Proportions of users with that friend count',
binwidth = 10, geom = 'freqpoly', color = gender) +
scale_x_continuous(lim = c(0, 1000), breaks = seq(0, 1000, 50) )
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
Quiz:
#Quick overlook at the data
summary(pf$www_likes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 49.96 7.00 14860.00
#More detailed percentile distribution
quantile(pf$www_likes, prob = seq(0, 1, length = 101), type = 5)
## 0% 1% 2% 3% 4% 5% 6% 7%
## 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 8% 9% 10% 11% 12% 13% 14% 15%
## 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 16% 17% 18% 19% 20% 21% 22% 23%
## 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 24% 25% 26% 27% 28% 29% 30% 31%
## 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 32% 33% 34% 35% 36% 37% 38% 39%
## 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 40% 41% 42% 43% 44% 45% 46% 47%
## 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 48% 49% 50% 51% 52% 53% 54% 55%
## 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 56% 57% 58% 59% 60% 61% 62% 63%
## 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00
## 64% 65% 66% 67% 68% 69% 70% 71%
## 1.00 1.00 1.00 2.00 2.00 2.00 3.00 3.00
## 72% 73% 74% 75% 76% 77% 78% 79%
## 4.00 5.00 6.00 7.00 8.00 9.00 11.00 12.00
## 80% 81% 82% 83% 84% 85% 86% 87%
## 14.00 17.00 19.00 23.00 27.00 31.00 36.00 42.00
## 88% 89% 90% 91% 92% 93% 94% 95%
## 50.00 60.00 72.00 86.00 104.00 128.00 160.00 208.00
## 96% 97% 98% 99% 100%
## 276.00 378.00 568.00 1001.47 14865.00
qplot(x = www_likes, y = ..density../sum(..density..),
data = subset(pf, !is.na(gender)),
xlab = 'Likes',
ylab = 'Proportions of users with that many likes',
binwidth = 10, geom = 'freqpoly', color = gender) +
scale_x_continuous(lim = c(1, 208), breaks = seq(0, 208, 10) )
## Warning: Removed 65873 rows containing non-finite values (stat_bin).
## Warning: Removed 8 rows containing missing values (geom_path).
#Solution from video, equivalent ggplot syntax
ggplot(aes(x = www_likes), data = subset(pf, !is.na(gender))) +
geom_freqpoly(aes(color = gender)) +
scale_x_log10()
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 60935 rows containing non-finite values (stat_bin).
Notes:
sum(subset(pf, gender == 'male')$www_likes)
## [1] 1430175
sum(subset(pf, gender == 'female')$www_likes)
## [1] 3507665
#Alternate solution from video
by(pf$www_likes, pf$gender, sum)
## pf$gender: female
## [1] 3507665
## --------------------------------------------------------
## pf$gender: male
## [1] 1430175
Notes:
qplot( x = gender, y = friend_count,
data = subset(pf, !is.na(gender) & friend_count < 1000 ),
geom = 'boxplot' )
#Alternate solution 1 from video
qplot( x = gender, y = friend_count,
data = subset(pf, !is.na(gender)),
geom = 'boxplot',
ylim = c(0,1000) )
## Warning: Removed 2949 rows containing non-finite values (stat_boxplot).
#Alternate solution 2 from video
qplot(x = gender, y = friend_count,
data = subset(pf, !is.na(gender)),
geom = 'boxplot') +
scale_y_continuous(limits = c(0, 1000))
## Warning: Removed 2949 rows containing non-finite values (stat_boxplot).
#Alternate solution 3 from video: most accurate (does nor remove data points)
qplot(x = gender, y = friend_count,
data = subset(pf, !is.na(gender)),
geom = 'boxplot') +
coord_cartesian(ylim = c(0, 1000))
#See above
Notes:
qplot(x = gender, y = friend_count,
data = subset(pf, !is.na(gender)),
geom = 'boxplot') +
coord_cartesian(ylim = c(0, 250))
by(pf$friend_count, pf$gender, summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 27 74 165 182 4917
Response:
by(pf$friendships_initiated, pf$gender, summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 19.0 49.0 113.9 124.8 3654.0
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 15.0 44.0 103.1 111.0 4144.0
by(pf$friendships_initiated, pf$gender, mean)
## pf$gender: female
## [1] 113.8991
## --------------------------------------------------------
## pf$gender: male
## [1] 103.0666
Response:
qplot(x = gender, y = friendships_initiated,
data = subset(pf, !is.na(gender)),
geom = 'boxplot') +
coord_cartesian(ylim = c(0, 150))
Response: I found out which gender on average initiate the most friendships by running the by() function for friendships_initiated and gender. I also took a look at the median and the percentiles, and for both males and females the mean is closer to the 3rd quartile than the median. This seems to be due to some very large outlier users, who have sent out a large amount of friend requests. ***
Notes:
summary(pf$mobile_likes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 4.0 106.1 46.0 25110.0
#You often want to convert features with a lot of 0 values to a binary value (True/False)
#Logical variable
summary(pf$mobile_likes > 0)
## Mode FALSE TRUE NA's
## logical 35056 63947 0
pf$mobile_check_in <- NA
pf$mobile_check_in <- ifelse(pf$mobile_likes > 0, 1, 0)
#Making into categorical type
pf$mobile_check_in <- factor(pf$mobile_check_in)
summary(pf$mobile_check_in)
## 0 1
## 35056 63947
#Calculation percentage of checked in users
summary(pf$mobile_check_in)[2] / nrow(pf)
## 1
## 0.6459097
#Solution from video
sum(pf$mobile_check_in == 1) / length(pf$mobile_check_in)
## [1] 0.6459097
Response: 64.59% ***
Reflection: I learned more R syntax. I hadn’t really used box plots before, so that was useful. Frequency polygons were also new to me, I liked learning about that. In general I got a refresher in different ways of approcaching (mostly exploratory) data analysis. ***
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!